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Chemically accurate and comprehensive studies of the virtual space of all possible molecules 
are severely limited by the computational cost of quantum chemistry. We introduce a composite 
strategy that adds machine learning corrections to computationally inexpensive approximate legacy 
quantum methods. After training, highly accurate predictions of enthalpies, free energies, entropies, 
and electron correlation energies are possible, for significantly larger molecular sets than used for 
training. For thermochemical properties of up to 16k constitutional isomers of C 7 H 10 O 2 we present 
numerical evidence that chemical accuracy can be reached. We also predict electron correlation 
energy in post Hartree-Fock methods, at the computational cost of Hartree-Fock, and we establish 
a qualitative relationship between molecular entropy and electron correlation. The transferability 
of our approach is demonstrated, using semi-empirical quantum chemistry and machine learning 
models trained on 1 and 10% of 134k organic molecules, to reproduce enthalpies of all remaining 
molecules at density functional theory level of accuracy. 


I. INTRODUCTION 

Designing new molecular materials is one of the key 
challenges in chemistry, and a major obstacle in solving 
many of the pressing issues that today’s society faces, 
such as clean and cheap water, advanced energy mate¬ 
rials, or novel drugs to fight antibiotic resistant bacte¬ 
ria. Unfortunately, the number of potentially interesting 
small molecules is too large for exhaustive screening p- 
3 . even when relying on automated synthesis and com¬ 
binatorial high-throughput “click-chemistry” HE]. Vir¬ 
tual screening strategies, made feasible by ever increas¬ 
ing compute power, advanced atomistic simulation soft¬ 
ware, and quantitative structure-property relationships 
have already helped in the discovery of new materials, 
and provided crucial guidance for subsequent experimen¬ 
tal characterization and fabrication EH33]. To achieve 
the overall goal of de novo in silico molecular and mate¬ 
rials design [THIS], however, substantial progress is still 
necessary [16], especially regarding prediction accuracy, 
computational speed, and transferability of the employed 
models. 

For quantum chemistry models to attain “chemical ac¬ 
curacy” (« 1 kcal/mol) in the prediction of covalent bind- 
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ing is crucial in many scientific domains. Examples in¬ 
clude the understanding of combustion processes HMS; 
questions relevant interstellar chemistry [20]; and predic¬ 
tion of reaction rates essential for catalysis. The lat¬ 
ter depend exponentially on energy differences, implying 
that small errors on the order of ksT propagate dra¬ 
matically. More generally, reaching chemical accuracy 
can be crucial for the detection of new structure prop¬ 
erty relationships, trends or patterns in Big Data, the 
design of new molecular materials with sensitive prop¬ 
erty requirements, or the energetics of competing reac¬ 
tants and products determining mechanisms and reaction 
rates. Control over the accuracy of important thermo¬ 
chemical properties of molecules can be achieved through 
application of well-established hierarchies in quantum 
chemistry m- Calibrated composite methods such as 
John Pople’s Gaussian model chemistry exploit the in¬ 
herent transferability of corrections to electronic correla¬ 
tion, the Born-Oppenheimer approximation, or basis-set 
deficiencies Ei Eg. This has enabled chemists to rou¬ 
tinely achieve chemical accuracy for any non-exotic and 
medium-sized organic molecule at substantial yet man¬ 
ageable computational costs [24, 25 . 

Unfortunately, such calculations are too demanding for 
the routine investigation of larger subsets of chemical 
space. Note, however, that the computationally most 
demanding task in a quantum chemistry calculation cor- 
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responds to an energy contribution that constitutes only 
a minor fraction of the total energy, while most of the 
relevant physics can already be accounted for through 
computationally very efficient approximate legacy quan¬ 
tum chemistry, such as the semi-empirical theory PM7, 
Hartree-Fock (HF), or even density functional theory 
(DFT). For the water molecule H 2 O, for example, HF 
approximates the experimental ionization potential by 
more than 90% [26\ Calculating the remaining A with 
chemical accuracy using correlated electronic structure 
methods requires a disproportionate amount of compu¬ 
tational effort due to unfavorable pre-factors and scaling 
with number of electrons. In this study, we introduce an 
alternative Ansatz to model the expensive A using a sta¬ 
tistical model trained on reference data requiring only a 
fraction of the computational cost. The observed speed¬ 
up, up to several orders of magnitude, is due to the com¬ 
putational efficiency of machine learning (ML) models. 
We have validated this idea for several molecular prop¬ 
erties, combining quantum chemistry results at several 
levels of theory with A-ML models trained over compre¬ 
hensive molecular data sets drawn from 134 kilo organic 
molecules published in Ref. [27j While the basic idea to 
augment approximate models with ML is not new na¬ 
ns, we present a generalized A-ML-model that achieves 
unprecedented chemical accuracy and transferability. 

We present numerical evidence for predicted atomiza¬ 
tion enthalpies, free energies, and electron correlation in 
many thousands of organic molecules (reaching molec¬ 
ular weights of up to 150 Dalton) with an accuracy of 
~1 kcal/mol at the computational cost of DFT or PM7. 
We validate the A-ML model for entirely new subsets 
of chemical space, up to two orders of magnitude larger 
than the set used for training. Using A-ML-based screen¬ 
ing, we find that within the constitutional isomers of 
C 7 H 10 O 2 , molecular entropy and electron correlation en¬ 
ergy of atomization are not entirely independent from 
each other. This suggests not only significant coupling 
between electronic and vibrational eigenstates but also 
the existence of Pareto fronts that can impose severe lim¬ 
itations with respect to simultaneous property optimiza¬ 
tion. Finally, we establish transferability by accurately 
predicting properties for a much larger molecular dataset 
comprising of 134k molecules. 


II. THE A-ML APPROACH 

The Ajymodel of a molecular property corresponds to 
a baseline ( b ) value plus a correction, towards a targetline 
(t) value, modeled statistically. More specifically, given 
a property P 5 ', such as the energy , for the relaxed 
geometry R 5 of a new query molecule, calculated using 
an approximate baseline level of theory, another related 
property P t , such as the enthalpy H t , corresponding to a 
more accurate and more demanding target level of theory 


can be estimated as 

N 

P t (Rt ) * a l(R b ) = P(,(R b ) + Y / aik(Rb,Ri). (1) 

i= 1 

The sum represents an ML-model, here a linear com¬ 
bination of Slater-type basis functions, k(Rb,Ri) = 
e -\Ri-R b \/cr^ cen tered on N training molecules, and 
with global hyperparameter a — the kernel’s width. The 
regression coefficients {a^} are obtained through ker¬ 
nel ridge regression, a regularized nonlinear regression 
model m that limits the norm of regression coefficients, 
thereby reducing overfitting and improving the transfer¬ 
ability of the model to new molecules. \Ri — Rb\ is a quan¬ 
titative measure of similarity between query molecule and 
training molecule i, using the Manhattan-norm (Pi) be¬ 
tween sorted Coulomb matrix representations [32, 133] . 
The latter uniquely encodes (except among enantiomers) 
the external potential of any given molecule in a way 
that is invariant with respect to molecular translation, 
rotation, or atom-indexing. As such P t of a new molecule, 
consistent with its minimum geometry R t at the target 
level of theory, is estimated using exclusively R 5 and P 6 ' 
as input. Thus, the A-model accounts for differences in 
(i) definition of property observable, e.g. energy en¬ 
thalpy, (ii) level of theory, e.g. PM7 —G4MP2, and (iii) 
changes in geometry (illustrated in [l]) . 
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FIG. 1. Two hypothetical property profiles connecting two 
constitutional isomers of C7H10O2. The A-model, Eq. 0, 
estimates the difference between baseline and targetline prop¬ 
erties (arrow) which differ in level of theory (b t), geometry 
(Rb Rt), and property (Eb —» H t ). 

As a first test of our Ansatz , we have trained A£ 
models for HOMO and LUMO eigenvalues calculated at 
various levels of theory [34] for the smallest 7k organic 
molecules in the GDB-dataset introduced by Reymond 
and coworkers [35] . After training on calculated data 
for lk molecules, the resulting ”lk A-model” reduces 
the mean absolute error (MAE) in the prediction of GW 
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HOMO eigenvalues for the remaining 6 k molecules from 
0.78 to 0.23 eV for the semi-empirical ZINDO baseline 
method. Interestingly, while the less empirical DFT hy¬ 
brid (PBEO) baseline method has an MAE of more than 2 
eV, this reduces to less than 0.1 eV when combined with 
the lk A-model. Correspondingly, MAEs for predicting 
GW LUMO eigenvalues reduce from 0.91 to 0.16 eV and 
from 1.3 to 0.13 eV for an d ^pbeo’ respectively. 

This observation suggests that more sophisticated base¬ 
line models, albeit occasionally leading to more substan¬ 
tial errors than simpler models, overall are smoother in 
chemical space, and therefore easier to learn. 



FIG. 2. Illustration of chemical diversity and data density of 
up to -100 molecules per kcal/mol of atomization enthalpy H 
(G4MP2 level of theory [241125] ). shown in ascending order for 
all 6k constitutional isomers of C 7 H 10 O 2 in the GDB-17 data¬ 
set [27l 135] , Seven near degenerate (within « 0.01 kcal/mol) 
molecules in the inset exemplify the chemical diversity. 


III. RESULTS AND DISCUSSION 

A. Chemically accurate prediction of covalent 
bonding 

To demonstrate that the A-ML model can reach chem¬ 
ical accuracy, we have investigated the covalent binding 
energies of the 6 k constitutional isomers of C 7 H 10 O 2 in 
the GDB [35]. Note that any other molecular set could 
have been used just as well. We relied on previously cal¬ 
culated highly accurate target level atomization energies 
(E t in |T]) at the G4MP2 level [25] for these isomers [271 . 
G4MP2 is widely considered to be on par with exper¬ 
imental uncertainties [36] . While structurally highly 
diverse, this data-set exhibits many near-degeneracies 
with high energy densities of up to ~100 molecules per 


kcal/mol in atomization enthalpy H ( inset of§. 

The A-ML model’s systematic improvement of accu¬ 
racy with increasing training set size is shown as a log-log- 
plot in [3] for the potential energy of atomization. Starting 
at different offsets, corresponding to the respective error 
of the pure baseline methods, the MAE, measured out-of- 
sample on the remaining molecules in the 6 k set, rapidly 
decreases. Note the constant decay rate for training set 
sizes larger than 1000 for all Ap 4MP2 -ML model errors. 
This suggests that for all models the error could be low¬ 
ered even further if more training data were used. 



FIG. 3. Mean absolute error (MAE) [kcal/mol] of ML pre¬ 
dicted atomization energies compared to G4MP2 reference 
values as a function of training set size N for out-of-sample 
predictions. The lines correspond to various baselines in the 
A^4M p 2_ moc i e i (Eq. |ij) MAE at N = 0 represents the 
baseline’s error. For comparison, the baseline-free ML model 
is shown as well (green); its iV=0 value is the standard de¬ 
viation in the G4MP2 atomization energies of the data set. 
The ’’chemical accuracy” target of 1 kcal/mol is highlighted 
in blue. A§/m? 2 (brown) and Ap^ P2 (pink) are variants of 
Apm 7 P2 (yellow) using reparameterized PM7 as baseline or 
an alternative molecular representation, respectively. 

While the error of the GGA DFT baseline model 
(ApBE P2 ) in [ 3 ] starts off slightly higher than the 
more accurate hybrid DFT analogue (Aggj^?), both 
rapidly converge to chemical accuracy (< 1 kcal/mol) 
for less than lk training molecules. Agg]^ 2 reaches 
^0.4 kcal/mol for a 5k training set. For the sub¬ 
stantially faster, yet more approximate PM7 baseline 
model (Ap^f P2 ) an accuracy similar to pure hybrid DFT 
(B3LYP) can be achieved with 2k training molecules, and 
for 5k training molecules the error has been quenched 
to less than 2 kcal/mol. All A-ML models outperform 
an ML model trained on the absolute value of E t di¬ 
rectly [32/33], without a baseline. 

We also considered the effect of reparameterizing the 
baseline method on the training set before applying A- 
ML. The arguably simplest model, Benson’s thermo¬ 
chemical bond additivity model (bound counting), has 
a prediction error of MAE« 100 kcal/mol for the 6k iso¬ 
mers. After reparameterization of all bond energies to 
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fit the training data, its MAE reduces to « 30 kcal/mol, 
which is worse than direct ML. Along the same line, we 
optimized all semi-empirical parameters in PM7 in or¬ 
der to reproduce G4MP2 atomization enthalpies of up to 
128 C 7 H 10 O 2 isomers, drawn at random. [ 3 ] (brown line) 
shows that when using this reparameterized (RPM7) 
baseline model in A§p ^ P2 it does allow for an improved 
off-set, however the advantage vanishes when increasing 
training set size beyond lk ©• Using the semi-empirical 
model OM2 m, instead of PM7, we found a similarly 
vanishing effect. These findings indicate the severe lim¬ 
itations inherent in fixed functional forms of electronic 
semi-empirical model Hamiltonians used in combination 
with globally optimized parameters—no matter the ac¬ 
tual combination of parameters. Statistically learned 
A-corrections, inferred from large numbers of example 
molecules, however, seem capable of capturing the more 
delicate energy contributions in the G4MP2 energy. We 
have also tested the effect of using an alternative molec¬ 
ular representation. The Ap ^ P2 model in [ 3 ] (pink line) 
shows the improvement of performance of the Ap ^ P2 
model when replacing the above mentioned Coulomb- 
matrix representation by the bag-of-bond descriptor re¬ 
cently introduced by one of us [38]. Encouragingly, also 
for this descriptor one observes similar decay rates, and 
an even better performance than for the Coulomb-matrix 
based A-model, reaching chemical accuracy for a training 
set size of 5k. 


B. Chemically accurate thermochemistry 

Prediction accuracy for thermochemical properties, 
such as internal energies, enthalpies, free energies and 
entropies of atomization at 298.15 K, all trained to re¬ 
produce G4MP2 target level of theory for the same set 
of 6 k constitutional isomers of C 7 H 10 O 2 were investi¬ 
gated. A-ML models have been trained for three base¬ 
lines, Ap^f P2 , Apgp P2 , and on subsets of vary¬ 

ing sizes. The baseline properties corresponded in this 
case simply to the potential energy of atomization, with 
the ML model accounting for differences in level of theory, 
in geometry, as well as for the respective thermodynamic 
effects. [I] lists resulting errors and standard deviations of 
predicted enthalpies of atomization at 298.15 K for var¬ 
ious trainingset sizes. As before, A-ML models display 
rapid error decay with increasing training set size. En¬ 
couragingly the standard deviation also decays rapidly 
with training set size. Again, the DFT baseline models 
yield MAEs of less than 1 kcal/mol already at lk train¬ 
ing set size, and the error of the 5k-Ag|J^ 2 -ML model 
remains below even after addition of the standard de¬ 
viation. The computationally less expensive PM7 base¬ 
line model performs slightly worse than the DFT based 
models. The lk-Ap^ P2 -ML model decreases the pure 
PM7 prediction error and standard deviation by more 
than ^50%, and converges to near chemical accuracy 
(1.7 kcal/mol) for a 5k training set. Computational ef¬ 


fort for out-of-sample predictions is dominated by base¬ 
line evaluations. For internal energies, free energies and 
entropies of atomization we have observed nearly iden¬ 
tical convergence and baseline trends. All these results 
indicate that the A-ML approach represents an inexpen¬ 
sive strategy to accurately estimate not only differences 
in potential energies due to different electronic structure 
models, but to also account for thermal contributions to 
thermodynamic state functions without having to calcu¬ 
late the corresponding partition functions. Note that the 
latter can be prohibitively expensive when using more 
accurate theories. 

TABLE I. Mean absolute errors 4= standard deviations 
for predicted out-of-sample enthalpies of atomization H 
(T— 298.15 K) at G4MP2 level of theory using the A^ 4MP2 - 
ML model for increasing training set size N in Eq. 0 . All 
values in kcal/mol. Training and test set sizes always add up 
to 6095 constitutional isomers of C 7 H 10 O 2 . 


N 

A G4MP2 AG4MP2 a G4MP2 
n PM7 ^PBE ^B3LYP 

0 

6.44=8.6 

3.0T4.1 

2.54=3.1 

0 . 1 k 

5.7T7.6 

2.2T2.9 

1.54=1.9 

lk 

3.94=4.1 

0 . 8 T 1.1 

0.7T0.9 

2 k 

2.4T3.1 

0 . 6 T 0.8 

0.6T0.7 

3k 

2 . 2 T 2.8 

0.5T0.7 

0.5T0.6 

4k 

1.9T2.4 

0.5T0.6 

0.4T0.6 

5k 

1.7T2.2 

0.5T0.6 

0.4T0.5 


C. Electron correlation 

To further assess the applicability of the A-ML Ansatz , 
we modeled electron correlation energies, essential for 
achieving chemical accuracy. [21] Within post-HF the¬ 
ory, electron correlation energy can be defined as the 
difference between converged basis-set HF potential en¬ 
ergy and its corresponding non-relativistic exact counter¬ 
part. [26 Evaluating the many-electron correlation en¬ 
ergy at the post-HF level of theory requires substantial 
computational effort. The computational complexity of 
the simplest post-HF method, second order perturbation 
theory (MP2), scales as , where N e is the number of 
electrons. The “gold standard” of quantum chemistry, 
CCSD(T), even scales as Nj. Revisiting the 6 k consti¬ 
tutional isomers of C 7 H 10 O 2 ([ 2 ]), we have calculated the 
difference in the correlation energy part of molecular at¬ 
omization energies for various correlated methods. 

For the 6 k isomers MAEs, after accounting for system¬ 
atic shifts, are shown in[4]for various combinations of HF, 
MP 2 , CCSD, and CCSD(T), along with the reduction of 
error due to lk-A£ models. Particularly noteworthy is 

the result for the A^[p SD( ^ model giving the total cor¬ 
relation energy (defined as difference between HF and 
the exact result as approximated by CCSD(T)) between 
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FIG. 4. Mean absolute error (MAE) [kcal/mol] of Eb (blue 
bars) and lk A{j-ML model predictions (red bars) of atomiza¬ 
tion energies for various combinations of increasingly corre¬ 
lated post-Hartree-Fock methods as target and baseline meth¬ 
ods. These values are obtained after subtracting the system¬ 
atic average shift that exists between methods for the 6k iso¬ 
mers. See Supplementary Information for details. 


the least and most expensive method: The MAE is re¬ 
duced from ^2.9 to less than 1 kcal/mol. Note that the 
lk-A^p 2 model has a larger MAE from MP2 than the 
MAE of the lk-A§p SD from CCSD, even though MP2 
is more approximate in nature than CCSD. As such, the 
MP2 target, albeit less accurate and more approximate 
than CCSD, appears to be a more complex function in 
chemical space. 

Remarkably, when using the lk-A^p| D or lk- 

^MP 2 D( ^ m °dels the MAE amounts in both cases to less 
than 0.5 kcal/mol. This suggests the possibility of the 
proverbial free lunch modeling the more accurate and 
computationally more expensive CCSD(T) level of the¬ 
ory with the same training set size and precision as the 
less accurate and less expensive CCSD method. The 
least approximate baseline, encoded by the lk-A^g^ 1 ^ 
model, yields a chemically nearly negligible MAE from 
CCSD(T), ^0.1 kcal/mol. For larger training set sizes we 
have observed correlation energy errors to decay similarly 
to the error of DFT baseline models for thermodynamic 
properties. 


D. Applicability: Diastereomers of C7H10O2 

We have tested the applicability of the lk-A-ML 
model, trained on lk out of the 6 k C 7 H 10 O 2 isomers in 
the GDB database, for the identification of the most sta¬ 
ble diastereomers that can be generated from the parent 
isomers. Such screening applications are highly relevant 
for spectroscopic or computational experiments aimed at 
the discovery and characterization of competing reaction 
pathways, recently discussed for an u ab initio nanoreac- 
tor” [39]. More specifically, we applied the lk 


model of atomization enthalpy at 298.15 K (|T|), to screen 
all the 9868 unique and stable diastereomers resulting 
from inversion of atomic stereocenters in the original 
GDB set of 6095 constitutional isomers of C 7 H 10 O 2 (see 
Methods section). For validation, we have randomly 
drawn 3k diastereomers and calculated their computa¬ 
tionally demanding G4MP2 enthalpies of atomization. 
The lk-Ag |]^> 2 model yields a MAE of 0.8 kcal/mol for 
these 3k diastereomers. We have chosen the DFT base¬ 
line for this exercise because cheaper baseline models, 
such as 5k Ap ^ P2 and Ap{^ P2 * Q, exhibit less trans¬ 
ferability when validated on the G4MP2 results for the 
3k diastereomers, namely MAEs of 3.5 and 2.8 kcal/mol, 
respectively. 

Out of all the 10 k diastereomers, the lk-Agg ^^ 2 model 
predicts 6-oxabicyclooctan-7-one, which is caprolactone 
with a methyl bridge between positions 1, and 5. with an 
estimated atomization enthalpy H of-1933.5 kcal/mol, to 
be the most stable isomer at ambient conditions. A val¬ 
idating G4MP2 calculation yielded the same number. [5] 
shows this molecule along with its ten enthalpically clos¬ 
est isomers. These span a narrow energetic window of 
9 kcal/mol, which is sparse in comparison to the afore¬ 
mentioned 100 molecules/kcal/mol energy density. The 
six isomers for which AH < 6 kcal/mol correspond to 
diastereomers of oxabicyclo[2.2.1]heptan-3-one, methy¬ 
lated at 1,4,5,5, 6 ,7 positions, respectively. Isomers 3 
(A H = 4.3 kcal/mol) and 4 (AH = 4.5 kcal/mol) dif¬ 
fer only by the chirality of the carbon atom at position 
1. The next four high-lying isomers, although populat¬ 
ing only a narrow AH range of 7-9 kcal/mol, exhibit 
very diverse chemical structures: They include a cy¬ 
clopentane fused with 7 -valerolactone, a methyl, ethyl- 
substituted furanone, a methylated cyclo hexanedione, 
and a cyclopentane fused with /3— propiolactone and 
methylated bridge atom framework. After identification 
of these isomers, we calculated validating G4MP2 en¬ 
thalpies ([ 5 ]). The lk-Ag |^^ 2 ML model estimates the 
isomerization enthalpy of these products with a maximal 
error of 0.6 kcal/mol for product 10. The ML-model pre¬ 
dictions agree with G4MP2 results calculated a posteriori 
, and never exceed the threshold of chemical accuracy (1 
kcal/mol). 

For comparison, [5] also features estimates obtained 
from the DFT baseline method B3LYP, which is pop¬ 
ular among many computational as well as experimental 
chemists. While B3LYP would have predicted the same 
global minimum, its reaction enthalpies can deviate sub¬ 
stantially, and, sometimes fail spectacularly (isomer 8 ). 
It is interesting to note that the ML-model is apparently 
capable to reduce or increase the estimate depending on 
its baseline overshooting (isomers 1 - 6 ) or underestimat¬ 
ing (isomers 7-9). Only in the case of isomer 10, use of the 
ML-model would deteriorate the baseline’s prediction er¬ 
ror, albeit only from 0.3 to 0.5 kcal/mol. We believe that 
such overall agreement of predicted reaction enthalpies 
with G4MP2 results obtained a posteriori strongly in¬ 
dicates that the A-ML Ansatz is capable to account for 
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subtle errors made in the prediction of competitive chem¬ 
ical bonding— at the baseline’s computational cost (in 
this case DFT). 


E. Interpretation of the A-Model: 

One can understand the trained corrections as follows: 
The ML model of atomization energies can be 

viewed as a ML model of the correlation energy of at¬ 
omization. Likewise, when using atomization energies as 
baseline properties for free energies, and enthalpies, the 
difference in the resulting ML models cancels the base¬ 
line energy and corresponds, after division by T, to the 
entropy of atomization 

(Af - Ag) /T = S. ( 2 ) 

Using a random lk subset of the 6 k C 7 H 10 O 2 isomers, 
we have trained two ML models, one on S of atomiza¬ 
tion at G4MP2 level of theory, taken as (H — G)/T from 
Ref. [27j the other on E c at CCSD(T) level of theory, 
also from Ref. 27. Computationally efficient PM7 equi¬ 
librium geometries have been used for training, testing, 
and predictions. 

We have reapplied the resulting lk models to screen 
the aforementioned 10 k diastereomers for those molecu¬ 
lar pairs which exhibit maximal isomerization entropies 
AS and correlation energies A E c . Structures of the 
molecules with extreme S and E c are shown in [ 6 j The 
molecular pair with maximal AS is consistent with 
chemical intuition: The lowest entropy isomer, 2,5- 
dioxatricyclononane, has a cage-like structure and is very 
compact, bearing some resemblance to adamantane, and 
void of any conformational degree of freedom. By con¬ 
trast, the molecule with largest entropy, 5-methoxyhex- 
3-ynal, possesses multiple conformational degrees of free¬ 
dom, made possible through the occurrence of a double 
and a triple bond that consume the valences otherwise 
accessible for ring or cage formation. The resulting AS 
estimated by the ML model deviates from the reference 
G4MP2 value by only 1 cal/mol/K. Quantitative ratio¬ 
nalization of the trend in E c is less obvious, owing to its 
origin in electronic many-body effects. However, E c can 
be intuitively related to the number of interacting elec¬ 
tron pairs. This number is small when the molecule is 
long since according to the nearsightedness principle 00 ], 
electrons localized on one end of the molecule interact less 
with those from the other end. In compact molecules by 
contrast, more electrons can interact, hence the num¬ 
ber of interacting electron pairs, and consequently the 
magnitude of E c , are large. After screening of diastere¬ 
omers using our E c model we found among 10k diastere¬ 
omers 6-methyl-2-oxatricycloheptan-6-ol and 5-methoxy- 
2-methylpent-3-ynal to have the maximal reaction elec¬ 
tron correlation energy, see [6j The maximal electron cor¬ 
relation energy difference, 24 kcal/mol, deviates from val¬ 
idating CCSD(T) reference results by 1 kcal/mol. Both 


molecules confirm intuition: The most compact molecule 
with few degrees of freedom exhibits maximal correlation 
while the most elongated corresponds to the least amount 
of correlation energy. 

In above discussion we notice that for both pairs with 
maximal difference in electron correlation and in en¬ 
tropy, respectively, similar observations hold: Compact¬ 
ness/extension appears to maximize the difference in 
both cases. This observation raises the question whether 
5, which arises from the vibrational partition function, 
and E c , due to electronic many-body effects, are inter¬ 
dependent. An underlying relationship could aid not 
only in pin-pointing molecules that pose interesting chal¬ 
lenges for benchmarking approximate electronic theories 
but might even lead to semi-quantitative estimations of 
E c via S. Thermal molecular properties, such as heat- 
capacities, could be linked directly to their electronic 
structure. To elucidate their potential relationship, [7] 
shows a scatter plot of the model-predicted atomiza¬ 
tion entropy and correlation energies of the 10 k diastere¬ 
omers, as well as the 6 k parent isomers of C 7 H 10 O 2 . Al¬ 
beit hardly quantitative, a qualitative interdependence 
is revealed, suggesting a molecular analogue to phonon- 
electron coupling phenomena in solids. 

Such a dependency might serve the construction of 
rough structure property relationships for the filtering 
of compounds using one property as a mutual descrip¬ 
tor for the other. Furthermore, this relationship could 
possibly impose severe constraints on how freely S and 
E c can be varied independently within multi-objective 
property optimization procedures in chemical compound 
space. To further illustrate this point, [7] also highlights 
corresponding Pareto fronts. Note, for example, that 
while [ 6 ] displays the pairs that maximize the vertical ( E c ) 
or horizontal ( S ) axis in[7| the molecular pair that simul¬ 
taneously maximizes both differs. Other molecules, such 
as bullvalene, also happen to fall onto the same linear 
relationship. However, for organic molecules with very 
different sizes, taken from the 134k GDB-9 dataset, this 
linear trend breaks down. As such, it might still require 
normalization by number of atoms or electrons to hold 
in general. 

We finally note that arriving at these observations 
exclusively via high-throughput ab initio computations 
would have required TVj-scaling G4MP2 calculations for 
all the 10 k diastereomers with an estimated need for com¬ 
pute time of ^20 CPU years. The PM7 baseline pre¬ 
dictions, by contrast, required only ~1 CPU day for all 
geometry relaxations, and the remaining deviation from 
target properties G4MP2-*? and CCSD(T )-E c is given 
instantaneously by the ML correction. 


F. Thermochemistry for 134 kilo organic molecules 

When dealing with hundreds of thousands of molecules 
one typically assumes that it is not necessary to achieve 
chemical accuracy for all of them. Instead, hierarchical 
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procedures where less accurate but computationally more 
efficient methods, such as DFT, filter out the most rele¬ 
vant compounds which subsequently can be studied us¬ 
ing more accurate and computationally more demanding 
methods, such as G4MP2. DFT calculations, however, 
are ordinarily too expensive to be used for filtering hun¬ 
dreds of thousands of molecules. We have investigated 
whether the A-approach can be used for DFT-quality 
filtering at the computational cost of a semi-empirical 
quantum chemistry calculation. Specifically, based on 
PM7 baselines we have predicted DFT targetline en¬ 
thalpies and entropies of atomization. To more system¬ 
atically assess transferability, we have trained a lk and 
10k training set drawn at random from the nearly 134k 
organic molecules containing up to nine C, N, O, or F 
atoms (published as GDB-9 in Ref. [27). For subsequent 
validation, we have used the remaining 133k and 124k 
molecules, respectively. 

On average, PM7 enthalpies of atomization deviate 
from B3LYP by 7.2 kcal/mol. For a randomly drawn 
training set of lk molecules, lk-APj^ p -ML predicts 
B3LYP enthalpies of the 133k additional (out-of-sample) 
molecules with an MAE of 4.8 kcal/mol. Increasing the 
number of training molecules to 10 k leads to an improved 
MAE of 3.0 kcal/mol, as measured for the remaining 124k 
out-of-sample molecules. We note that such a predictive 
accuracy places the 10 k-APj^ p -ML model on par with 
generalized gradient approximated (GGA) or even hy¬ 
brid DFT [36j [41] —at the computational cost of PM7. [ 8 ] 
features the corresponding scatter plot of actual versus 
predicted B3LYP enthalpies of atomization. The lower 
right inset shows that the baseline’s systematic underes¬ 
timation, as well as its skew, has been removed already 
by the lk-AP^^ P -ML model. The error distribution con¬ 
tracts further as the training set size is increased to 10 k. 
The upper left inset scatter plot illustrates the impor¬ 
tance of having a baseline, the pure ML contribution be¬ 
ing far from perfect correlation. The molecular structure 
on is the most extreme outlier, PM7 underestimates its 
stability by 86.0 kcal/mol. Encouragingly, lk and 10k- 
APj^ p -ML models reduce the error for this outlier to 
73.9 and 58.0 kcal/mol, respectively. 

We have also analyzed the effect of molecular geom¬ 
etry. It is well known that the faithfulness of common 
quantum chemical methods can alter drastically when 
changing the geometry of the molecule. Stretching chem¬ 
ical bonds, for instance, can lead to severe errors, even 
for methods that predict the energy minimum perfectly 
well. To systematically assess the effect of geometry, we 
compare the predicted B3LYP atomization enthalpies for 
PM7 and 10 k-APj^ p for all 134k molecules, as a func¬ 
tion of their normalized principal moments of inertia. [9] 
displays the resulting deviation from B3LYP, spanned 
by molecular geometry (rod, disk, or sphere-like). While 
PM7 has particularly strong deviations (~ 20 kcal/mol) 
on the linear to planar edge, as well as close to the lower 
part of the linear to spherical edge, use of the ML correc¬ 
tion homogeneously quenches the error throughout the 


triangle into the 5 kcal/mol error window, with very few 
20 kcal/mol outliers persisting on the rod-disk edge. Note 
that due to the non-uniqueness of the moments of in¬ 
ertia, error heatmaps in [9] of many molecules superim¬ 
pose each other in increasing order. To avoid possible 
mis-interpretations, the inset with a heat-map of data 
density provides a means to visually normalize the error 
heatmaps. 

Regarding the computational speed-up, we note that 
on a typical CPU, a single AP^^ P -ML evaluation re¬ 
quires no more than 10 seconds for the largest molecule 
in GDB-9. Thus, screening of the entire set of 134k 
molecules has consumed less than 2 CPU weeks. By 
contrast, the average computational cost for obtaining a 
B3LYP atomization enthalpy amounts to roughly 1 CPU 
hour per molecule, implying 15 CPU years for DFT based 
screening of the 134k molecules. 


G. Conclusions 

We have introduced a composite quantum chem¬ 
istry/machine learning approach. It combines approx¬ 
imate but fast legacy quantum chemical approxima¬ 
tions with modern big data-based machine learning es¬ 
timates trained on expensive and accurate reference re¬ 
sults throughout chemical space. We have shown that the 
A-ML model can be used to study other, out-of-sample 
molecules, not part of training. Effectively one can reach 
the accuracy of high-level quantum chemistry at a dra¬ 
matically lower computational burden which is domi¬ 
nated by the employed baseline method, such as semi- 
empirical quantum-chemistry (PM7), HF, or DFT. Mere 
reparameterization of the baseline method’s global pa¬ 
rameters for a given training set does not suffice, yielding 
measurable advantage only for very small and selected 
training and test sets. Alternative molecular representa¬ 
tions, however, could still lead to faster convergence to 
chemical accuracy. Similar learning rates with respect to 
training set size among all model-combinations, merely 
differing by off-set, suggest that even very approximate 
and computationally inexpensive baseline models can be 
used, provided access to sufficiently large training sets. 
For chemically diverse sets of organic molecules we have 
presented numerical evidence that chemically accurate 
molecular thermochemistry predictions can be made at 
a computational cost reduced by several orders of mag¬ 
nitude when compared to the reference method for new 
out-of-sample molecules. 

For the most stable isomer in the set of 10k diastere- 
omers generated from all 6 k molecules with C 7 H 10 O 2 sto¬ 
ichiometry in GDB-17 [35], we have demonstrated how 
to identify the ten most competitive reaction isomers. 
For the same diastereomers we also identified a quali¬ 
tative dependency between entropy and correlation en¬ 
ergy of atomization, suggesting a molecular equivalent 
of electron-phonon coupling. Finally, we have presented 
evidence for the transferability of the A-ML model by 


reducing the error of semi-empirical quantum chemistry 
method from 7.2 kcal/mol to the error of generalized 
gradient approximated (~ 5 kcal/mol) or hybrid density 
functional theory 3 kcal/mol) for over hundred thou¬ 
sand organic molecules using less than 1 and 10 % of them 
for training, respectively. 

We believe the high predictive accuracy to be due 
to the fact that approximate theories already capture 
the most important contributions to chemical energet¬ 
ics. The remaining deviations from the reference results 
are typically smaller, possibly also smoother, and prove 
to be more amenable to statistically trained ML mod¬ 
els. Overall, our results suggest that the A-ML-model 
represents an attractive strategy for augmenting legacy 
quantum chemistry with modern big data driven ML 
models. For future studies, this strategy might also offer 
substantial improvements to predictive accuracy of other 
properties such as heat capacities, non-adiabatic energy 
corrections, barriers of elementary reaction steps, optical 
properties, atomic forces for molecular dynamics calcu¬ 
lations, molecule specific parameters for semi-empirical 
theories, or electronic excitations. 

IV. METHODS 
A. Molecular datasets 

We have considered four sets of organic molecules. The 
first set has been used for preliminary testing of the 
Ansatz , and consists of the 7211 (7k) organic molecules 
and HOMO/LUMO eigenvalues and molecular polar¬ 
izabilities at different levels of theory as published in 
Ref. l34l The second set contains 133885 (134k) molecules 
with up to 9 heavy atoms (C, O, N, F, not counting H) 
in the universe of small organic molecules “GDB” [35] for 
which we calculated and published semi-empirical (PM7) 
and density functional theory (B3LYP)-based thermo¬ 
chemical properties such as enthalpies and entropies of 
atomization m- The diversity of this set is shown in 
[9] We note at this point that in A-ML models other 
baseline methods, such as extended Hiickel, tight-binding 
DFT [42] . OM 2 [371 , or AM05 [43] could have been used 
just as well. The third set corresponds to a subset of 
the second set: For 6095 ( 6 k) constitutional isomers of 
C 7 H 10 O 2 we calculated the same thermochemical proper¬ 
ties at significantly more sophisticated and computation¬ 
ally demanding level of theory, widely considered to be 
of “chemical accuracy” (~1 kcal/mol). Also this set has 
been published in Ref. [2Zl Finally, the versatility of this 
method is assessed for a fourth set of molecules, consist¬ 
ing of 9868 (10k) stable diastereomers that are not part 
of the GDB universe, and have been obtained by invert¬ 
ing all atomic stereocenters in the aforementioned third 
set of 6 k C 7 H 1002 -isomers. This dataset is a part of this 
publication, and is available on the authors’ homepage. 


B. Computational details 

From Ref. [35] we obtained all SMILES [44] strings 
for molecules with up to nine heavy atoms. We then 
excluded cations, anions, and molecules containing S, 
Br, Cl, or I, arriving at 133885 molecules. This data 
is presented and analyzed in more depth in Ref. [27] 
Cartesian coordinates for the subset of 6095 isomers of 
C 7 H 10 O 2 were determined by parsing the corresponding 
SMILES strings using Openbabel software 05], followed 
by a consistency check using the CORINA code [46] . 
Structures of 9868 non-enantiomeric stable diastereomers 
were obtained through inversions of chiral C atoms in 
the SMILES strings followed by conversion to Carte¬ 
sian coordinates using CORINA. To verify that all the¬ 
oretical methods preserved topology and chirality, we 
transformed the Cartesian coordinates back to SMILES, 
and InChl strings using Openbabel. Using these ini¬ 
tial structures, we carried out geometry relaxations 
at the PM7 [47] semi-empirical level of theory using 
MOPAC [48]. We used the PM7 equilibrium coor¬ 
dinates as initial geometries and performed DFT and 
G4MP2 geometry calculations using Gaussian09 [49]. For 
DFT calculations, we employed the Gaussian basis set 6 - 
31G(2df,p) which is also used in the G4MP2 calculations 
in combination with the DFT method B3LYP [50] . for 
geometry relaxation and frequency computations. We 
used the same basis set also in the GGA-PBE m cal¬ 
culations. G4MP2 employs harmonic oscillator and rigid 
rotor approximations to estimate the entropy of nuclear 
degrees of freedom [25]. At all levels of theory, we per¬ 
formed harmonic vibrational analysis for all molecules to 
confirm that the predicted equilibrium structures were 
local minima on the potential energy surface. HF, MP2, 
CCSD, CCSD(T) energies have been computed with the 
basis set 6-31 G(d) as a part of G4MP2. Further technical 
details regarding all quantum chemistry data, including 
convergence thresholds employed, are given in Ref. [27] 
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FIG. 5. Calculated reaction enthalpies at 298.15 K between the most stable molecule with C 7 H 10 O 2 stoichiometry (6- 
oxabicyclooctan-7-one, in inset, with atomization enthalpy -1933 kcal/mol), and its ten energetically closest isomers in increasing 
order, according to targetline method G4MP2 calculated a posteriori (black). Ik A§ 3 lyp ML model predictions are given in 
blue. Baseline method B3LYP predictions are shown for comparison (red). 



FIG. 6. Molecular pairs with maximal reaction entropy (top) 
and electron correlation energy (bottom) out of the 10k stable 
diastereomers of C7H10O2. Reaction properties have been 
estimated using a lk-Agl^^-ML model for the entropy and 
a lk-A^p SD( ' T ' ) -ML model for the correlation energy (black). 
Corresponding reference values calculated for validation are 
given in parentheses (blue). 
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FIG. 7. Scatter plot of ML-model predicted entropy of atom¬ 
ization at T — 298.15 K versus ML-model predicted electron 
correlation contribution to atomization energy for the 16k sta¬ 
ble out-of-sample diastereomers of C 7 H 10 O 2 . All predictions 
are made using ML models trained on lk molecules randomly 
drawn from set of 6k parent isomers. Training data is shown 
in red. Linear regression yields E c & 0.445 x TS — 175.211 
[kcal/mol]. Pareto fronts are indicated by convex hull shown 
in gray. 
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FIG. 8. Scatter plot of predicted atomization enthalpies H 
at 298.15 K for 124k out-of-sample GDB-9 molecules [35]. Es¬ 
timated H is plotted for PM7 (yellow, MAE = 7.2 kcal/mol) 
and lk (blue) and 10k (red) AP^/ P -ML models (MAE=4.8 
and 3.0 kcal/mol, respectively) versus reference B3LYP val¬ 
ues. The left side inset shows the ML contribution to the esti¬ 
mated energy differences between PM7 and reference B3LYP 
enthalpies, A est , for lk (blue) and 10k (red) models versus ref¬ 
erence difference, A ref . The right side inset shows the error 
distribution around B3LYP enthalpies for PM7, lk, and 10k 
models, respectively. The most extreme outlier (top, right), 
7-amino-3-oxatricycloheptan-4-one, has error 86, 74, and 58 
kcal/mol for PM7, lk, and 10k models, respectively. 
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FIG. 9. Shape distribution of color-coded absolute deviations between predicted and B3LYP reference atomization enthalpies 
of the 134k organic molecules with up to nine atoms (not counting hydrogens) in the GDB-17 data set [35]. Vertical and 
horizontal axes correspond to normalized principal moments of inertia /1//3 and I 2 /I 3 , respectively, with /1 < I 2 < h- (a) 
Large PM7 errors (> 25 kcal/mol) are predominantly present for geometries on the rod-disk edge, and in the center of the 
triangle. Corners indicate the geometrical shape of molecules with molecular drawings corresponding to examples of linear 
(ii = 0, h — I 3 ), planar (2/i — 2/2 — I 3 ), and spherical (I± — I 2 — I 3 ) cases, (b) 10k-APj^ p -ML errors are systematically 
smaller, some outliers at the rod-disk edge persist. A heatmap of the molecular data density is shown for the same coordinate 
system in the inset below (b). 













